Lab assignment: predicting car prices¶

In this assignment we will attempt to predict how car prices vary (usually going down) from first hand to second hand, according to different vehicle characteristics. We will exploit ensemble techniques to try to obtain as good models as possible.

Guidelines¶

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed. Both correctness of the solution and code quality will be taken into account for marking.
This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.
This is an advanced and voluntary excercise that can help you gain a deeper knowledge into the topic. This exercise won't be taken into account towards marking, but you are encouraged to undertake it. Good luck!

The following code imports the core libraries and objects used. Feel free to import other stuff if you need to in subsequent cells:

In [1]:
# Numpy and random seed
import numpy as np
RANDOM_STATE = 42; np.random.seed(RANDOM_STATE)
# Pandas with Holoviews
import holoviews as hv; hv.extension('bokeh')
from holoviews import opts
import hvplot.pandas  
import pandas as pd
pd.options.plotting.backend = 'hvplot'
# Inline Matlab plotting
import matplotlib.pyplot as plt
%matplotlib inline
# Sklearn base classifiers
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
import warnings
warnings.filterwarnings('ignore')

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells.

Let's go!

Data loading and preprocessing¶

In this assignment we will work with the vehicles dataset, which was retrieved from Craigslist:

In [2]:
#load the csv ignoring warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('./data/vehicles/vehicles.csv')
df
Out[2]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive type paint_color
0 6995 2000.0 gmc excellent 8 cylinders gas 167783.0 clean automatic 4wd NaN red
1 8750 2013.0 hyundai excellent 4 cylinders gas 90821.0 clean automatic fwd NaN grey
2 10900 2013.0 toyota good 4 cylinders hybrid 92800.0 clean automatic fwd NaN blue
3 12500 2003.0 mitsubishi good 4 cylinders gas NaN clean manual 4wd sedan grey
4 16995 2007.0 gmc good 8 cylinders diesel 254217.0 clean automatic 4wd truck white
... ... ... ... ... ... ... ... ... ... ... ... ...
423852 1600 2006.0 hyundai fair 6 cylinders gas 159980.0 clean automatic fwd sedan blue
423853 9000 2003.0 toyota excellent 8 cylinders gas 160000.0 clean automatic 4wd SUV green
423854 700 1994.0 ford fair 6 cylinders gas 212000.0 clean manual rwd NaN green
423855 3800 1999.0 lincoln excellent 8 cylinders gas 160000.0 clean automatic rwd sedan NaN
423856 8650 2015.0 nissan NaN NaN gas 160526.0 clean automatic fwd sedan silver

423857 rows × 12 columns

Let's take a look at the columns and their meaning:

  • price: the price (in dollars) the car is trying to be sold at.
  • year: year when that car was originally bought.
  • manufacturer: the car brand (Audi, BMW...).
  • condition: car status (like new, in good/bad condition...).
  • cylinders: how many cylinders the engine has.
  • fuel: what kind of fuel it uses.
  • odometer: mileage it has when sold.
  • title_status: if it's original or has been modified with extra pieces.
  • transmission: type of transmission (manual, automatic...).
  • drive: kind of drive (forward, rear...).
  • type: category of car (compact, SUV, coupe...).
  • paint_color: main color of the car (white, red...).

We've a faily large dataset (more than 400K rows), but there're quite a lot of missing values! The only column that is always there is price:

In [3]:
df.isna().sum(axis=0)
Out[3]:
price                0
year             95114
manufacturer    110615
condition       247138
cylinders       226178
fuel             96643
odometer        153272
title_status     96098
transmission     95792
drive           192738
type            182700
paint_color     201654
dtype: int64

To simplify things, let's simply eliminate all rows that have missing information. This leaves us with barely 86K samples:

In [4]:
df = df.dropna(how='any')
df = df.reset_index(drop=True)
df
Out[4]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive type paint_color
0 16995 2007.0 gmc good 8 cylinders diesel 254217.0 clean automatic 4wd truck white
1 13995 2012.0 ford good 6 cylinders gas 188406.0 clean automatic 4wd truck grey
2 7995 2010.0 chevrolet good 4 cylinders gas 108124.0 clean automatic 4wd SUV grey
3 8995 2011.0 chevrolet good 6 cylinders gas 178054.0 clean automatic 4wd SUV white
4 10995 2014.0 ford good 6 cylinders gas 170259.0 clean automatic 4wd SUV white
... ... ... ... ... ... ... ... ... ... ... ... ...
85993 9885 2012.0 subaru excellent 4 cylinders gas 82000.0 clean automatic 4wd hatchback silver
85994 750 2002.0 saturn salvage 4 cylinders gas 40000.0 missing manual fwd coupe blue
85995 4800 2002.0 ford good 6 cylinders gas 58000.0 clean automatic rwd coupe blue
85996 1600 2006.0 hyundai fair 6 cylinders gas 159980.0 clean automatic fwd sedan blue
85997 9000 2003.0 toyota excellent 8 cylinders gas 160000.0 clean automatic 4wd SUV green

85998 rows × 12 columns

Now we should also change year and odometer to integer type. As for cylinders, it can have the value 'other', but we can remove the word 'cylinders' itself:

In [5]:
df = df.astype({'price': 'int32', 'year': 'int16', 'odometer': 'int32'})
df['cylinders'] = df['cylinders'].where(df['cylinders'] == 'other', df['cylinders'].str.replace('cylinders', ''))
df
Out[5]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive type paint_color
0 16995 2007 gmc good 8 diesel 254217 clean automatic 4wd truck white
1 13995 2012 ford good 6 gas 188406 clean automatic 4wd truck grey
2 7995 2010 chevrolet good 4 gas 108124 clean automatic 4wd SUV grey
3 8995 2011 chevrolet good 6 gas 178054 clean automatic 4wd SUV white
4 10995 2014 ford good 6 gas 170259 clean automatic 4wd SUV white
... ... ... ... ... ... ... ... ... ... ... ... ...
85993 9885 2012 subaru excellent 4 gas 82000 clean automatic 4wd hatchback silver
85994 750 2002 saturn salvage 4 gas 40000 missing manual fwd coupe blue
85995 4800 2002 ford good 6 gas 58000 clean automatic rwd coupe blue
85996 1600 2006 hyundai fair 6 gas 159980 clean automatic fwd sedan blue
85997 9000 2003 toyota excellent 8 gas 160000 clean automatic 4wd SUV green

85998 rows × 12 columns

As for the numerical columns, note that there are some extreme values:

  • price: it shouldn't be 0 (free cars?), but of course it shouldn't be either 2.49 billion $!
  • year: there're really ancient cars (1923), and even a car from the future! (from 2021, perhaps a Delorean?)
  • odometer: maybe it can be 0 for a brand-new car, but 10 million miles?
In [6]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
#To facilitate the display of the data, I remove the scientific notation format.
df.describe(percentiles=[0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
Out[6]:
price year odometer
count 85998.000 85998.000 85998.000
mean -7240.961 2009.090 113764.810
std 6167590.244 8.667 124175.071
min -1804435921.000 1923.000 0.000
1% 0.000 1971.000 110.000
5% 600.000 1996.000 13077.000
10% 2495.000 2001.000 28000.000
25% 4995.000 2006.000 64491.750
50% 8995.000 2011.000 107960.500
75% 16000.000 2015.000 151217.000
90% 26873.000 2017.000 191600.000
95% 34750.000 2018.000 218990.600
99% 49500.000 2019.000 289003.090
max 123456789.000 2021.000 10000000.000

So let's do the following:

Filter out those car samples for which:
  • `price` is less than 500 or more than 50000 dollars.
  • `year` is less than 1990 or more than 2020.
  • `odometer` is exactly 0 or more than 300000 miles.
In [7]:
#filtering the dataset
df = df.drop(df[(df['price']<500) | (df['price']>50000)].index)
df = df.drop(df[(df['year']<1990) | (df['year']>2020)].index)
df = df.drop(df[(df['odometer']==0) | (df['odometer']>300000)].index)
df
Out[7]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive type paint_color
0 16995 2007 gmc good 8 diesel 254217 clean automatic 4wd truck white
1 13995 2012 ford good 6 gas 188406 clean automatic 4wd truck grey
2 7995 2010 chevrolet good 4 gas 108124 clean automatic 4wd SUV grey
3 8995 2011 chevrolet good 6 gas 178054 clean automatic 4wd SUV white
4 10995 2014 ford good 6 gas 170259 clean automatic 4wd SUV white
... ... ... ... ... ... ... ... ... ... ... ... ...
85993 9885 2012 subaru excellent 4 gas 82000 clean automatic 4wd hatchback silver
85994 750 2002 saturn salvage 4 gas 40000 missing manual fwd coupe blue
85995 4800 2002 ford good 6 gas 58000 clean automatic rwd coupe blue
85996 1600 2006 hyundai fair 6 gas 159980 clean automatic fwd sedan blue
85997 9000 2003 toyota excellent 8 gas 160000 clean automatic 4wd SUV green

77880 rows × 12 columns

In [8]:
df.describe()
Out[8]:
price year odometer
count 77880.000 77880.000 77880.000
mean 12425.825 2009.956 112367.979
std 9562.850 5.791 60041.653
min 500.000 1990.000 1.000
25% 5500.000 2006.000 68562.500
50% 9500.000 2011.000 110787.500
75% 16500.000 2014.000 153000.000
max 50000.000 2020.000 300000.000
In [9]:
#now there are no missing values
df.isna().sum(axis=0)
Out[9]:
price           0
year            0
manufacturer    0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
drive           0
type            0
paint_color     0
dtype: int64

Now we obtain reasonable distributions on price and year, except that there aren't many cars for very recent years (which makes sense, as those are mostly second hand cars, and in order to be second hand some years must pass!):

In [10]:
(
    df['price'].plot(kind='hist', bins=50, bin_range=(0, 50000)).opts(title='cars by price', xlabel='Price') +
    df['year'].plot(kind='hist', bins=30, bin_range=(1990, 2020)).opts(title='cars by year', xlabel='Year') +
    df['odometer'].plot(kind='hist', bins=30, bin_range=(0, 300000)).opts(title='cars by odometer', xlabel='Odometer')
).cols(1)

#As we can see, the graphs show quite proportionate and realistic results. In the "cars per year" graph, we can observe a dip in 2010 due to the banking crisis in the United States.
#That same graph shows a downward trend towards the end due to the transition between new and used cars, hence they will appear later in the market. It's not an anomaly, it's expected. 
#The "cars by odometer" graph is also quite reasonable, as the most sold cars have an average of 100,000 miles. As the miles increase, sales decrease, as most will end up in the scrapyard.
Out[10]:
Verify that the rest of features (all categorical) have reasonable values, and check which categories are allowed for each of those features.
In [11]:
#To verify that the remaining features (all categorical) have reasonable values, I have thought of creating a bar chart for visualization.
#For each feature, I will need to convert the categorical values to numerical ones,
#so I create a variable called "feature_counts" to group the number of records for each subcategory. Then, I create a bar chart using Holoviews.
manufacturer_counts = df['manufacturer'].value_counts().reset_index()
manufacturer_counts.columns = ['manufacturer', 'count']

bars_manufacturer = hv.Bars(manufacturer_counts, ['manufacturer'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_manufacturer
#As we can see, the best-selling brands are American ones (Chevrolet, Ford, Jeep), followed by other well-known brands like Toyota, Honda (Japan), and Hyundai (Korea).
#High-end brands have fewer records, such as Porsche,Aston Martin or Tesla, either because they sell fewer vehicles or because they are less commonly purchased as used cars.
Out[11]:
In [12]:
condition_counts = df['condition'].value_counts().reset_index()
condition_counts.columns = ['condition', 'count']
bars_condition = hv.Bars(condition_counts, ['condition'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_condition
#As we can see, the majority of cars are in excellent or good condition. 
#This may be because cars in good condition have more chances of being sold, or it could be a marketing strategy.
Out[12]:
In [13]:
cylinders_counts = df['cylinders'].value_counts().reset_index()
cylinders_counts.columns = ['cylinders', 'count']
bars_cylinders = hv.Bars(cylinders_counts, ['cylinders'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_cylinders
#Reasonable values for the number of cars given their displacement.
Out[13]:
In [14]:
fuel_counts = df['fuel'].value_counts().reset_index()
fuel_counts.columns = ['fuel', 'count']
bars_fuel = hv.Bars(fuel_counts, ['fuel'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_fuel
#The majority of cars are gasoline-powered.
Out[14]:
In [15]:
title_status_counts = df['title_status'].value_counts().reset_index()
title_status_counts.columns = ['title_status', 'count']
bars_title_status = hv.Bars(title_status_counts, ['title_status'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_title_status
#Reasonable values
Out[15]:
In [16]:
transmission_counts = df['transmission'].value_counts().reset_index()
transmission_counts.columns = ['transmission', 'count']
bars_transmission = hv.Bars(transmission_counts, ['transmission'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_transmission
#In the US, the most common is that cars have an automatic transmission.
Out[16]:
In [17]:
drive_counts = df['drive'].value_counts().reset_index()
drive_counts.columns = ['drive', 'count']
bars_drive = hv.Bars(drive_counts, ['drive'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_drive
#As we can see, rear-wheel drive is a very common type. Four-wheel drive also predominates for off-road vehicles, SUVs, and since the dataset also includes sales of trucks, vans, etc. 
#Lastly, rear-wheel drive is less common as it is typically reserved for high-end sports vehicles.
Out[17]:
In [18]:
ax=df.groupby(['drive', 'condition']).price.count().unstack().plot(kind = 'bar',figsize =(10, 7),
                                        title = 'Número de coches por tracción, según condición', rot=45,
                                        xlabel = 'Estado del coche',
                                        ylabel = 'Número de coches')
ax
WARNING:param.main: hvPlot does not have the concept of a figure, and the figsize keyword will be ignored. The size of each subplot in a layout is set individually using the width and height options.
Out[18]:
In [19]:
type_counts = df['type'].value_counts().reset_index()
type_counts.columns = ['type', 'count']
bars_type = hv.Bars(type_counts, ['type'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_type
Out[19]:
In [20]:
paint_color_counts = df['paint_color'].value_counts().reset_index()
paint_color_counts.columns = ['paint_color', 'count']
bars_paint_color = hv.Bars(paint_color_counts, ['paint_color'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))

bars_paint_color
Out[20]:

Finally, for all those categorical features we need to generate dummy variables (one-hot encoding), so that all features can be treated by models as purely numerical:

In [21]:
#One-hot encoding
df = pd.get_dummies(df)
df
Out[21]:
price year odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac ... paint_color_brown paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow
0 16995 2007 254217 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1 13995 2012 188406 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 7995 2010 108124 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 8995 2011 178054 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 10995 2014 170259 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85993 9885 2012 82000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
85994 750 2002 40000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85995 4800 2002 58000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85996 1600 2006 159980 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85997 9000 2003 160000 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

77880 rows × 98 columns

So that our final dataset has roughly 78K patterns with 100 columns each.

Base models¶

A well-known empirical result (https://www.free-online-calculator-use.com/car-depreciation-calculator.html) is the following:

  • A car loses approximately a 24% of its value during the 1st year.
  • From that year on, it loses approximately a 15% of its value each year.

This yields the following exponentially decreasing curve for a 30,000 $ car:

In order to check this out in our data, we need to switch from the year column to an age column:

Calculate the `age` column (using 2020 as the current year), add it to the data and delete the `year` column.
In [22]:
#I calculate the age of the cars from their date and we substitute the variable
df['year'] = 2020 - df['year']
df = df.rename(columns={'year': 'age'})
df
Out[22]:
price age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac ... paint_color_brown paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow
0 16995 13 254217 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1 13995 8 188406 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 7995 10 108124 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 8995 9 178054 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 10995 6 170259 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85993 9885 8 82000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
85994 750 18 40000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85995 4800 18 58000 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85996 1600 14 159980 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
85997 9000 17 160000 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

77880 rows × 98 columns

Note that we don't know the value of a car when it was purchased, but just the price it's being sold with, so we can't calculate its depreciation. However, ignoring inflation effects and assuming that cars worth the same value are purchased each year, we can approximate the above result if we group by age and calculate the average price of all cars with that age:

Calculate the mean price for each `age`, plot the results and calculate the % of decrease for each consecutive year. Do you obtain a similar result to the above one?
In [23]:
#I group by age and calculate the average price by age
avg_price_by_age = df.groupby('age')['price'].mean()
avg_price_by_age
Out[23]:
age
0    30033.957
1    28542.660
2    25884.679
3    22532.809
4    19179.775
5    18027.772
6    14848.909
7    13234.891
8    12040.311
9    11586.636
10    9387.075
11    8331.043
12    8741.302
13    7490.714
14    7435.246
15    6497.995
16    6310.352
17    5739.641
18    5820.298
19    5513.692
20    5444.955
21    5528.579
22    4920.148
23    5619.883
24    6006.901
25    6739.635
26    6195.187
27    6788.285
28    5825.006
29    6799.551
30    6674.127
Name: price, dtype: float64
In [24]:
plt.plot(avg_price_by_age)

plt.xlabel('Age')
plt.ylabel('Average Price')
plt.title('Average Price by Age')
plt.show()
In [25]:
# Create a DataFrame from the String 'avg_price_by_age'
df_avg_price_by_age = avg_price_by_age.reset_index()

# Generate the bar graph
plot = df_avg_price_by_age.hvplot.line('age', 'price', width=600, height=400)
plot 
Out[25]:
In [26]:
# Depreciación por año = (Precio de adquisición)/ Edad del coche)
df_edad_precio = df[['age', 'price']].copy()
df_edad_precio = df_edad_precio.groupby(['age']).mean().reset_index()
df_edad_precio["Depreciación"] = df_edad_precio ['price'].pct_change()
df_edad_precio["Depreciación"] = df_edad_precio ["Depreciación"] * 100
#the first row will have Nan value because it is the first year and depreciation cannot be calculated
df_edad_precio
Out[26]:
age price Depreciación
0 0 30033.957 NaN
1 1 28542.660 -4.965
2 2 25884.679 -9.312
3 3 22532.809 -12.949
4 4 19179.775 -14.881
5 5 18027.772 -6.006
6 6 14848.909 -17.633
7 7 13234.891 -10.870
8 8 12040.311 -9.026
9 9 11586.636 -3.768
10 10 9387.075 -18.984
11 11 8331.043 -11.250
12 12 8741.302 4.924
13 13 7490.714 -14.307
14 14 7435.246 -0.740
15 15 6497.995 -12.606
16 16 6310.352 -2.888
17 17 5739.641 -9.044
18 18 5820.298 1.405
19 19 5513.692 -5.268
20 20 5444.955 -1.247
21 21 5528.579 1.536
22 22 4920.148 -11.005
23 23 5619.883 14.222
24 24 6006.901 6.887
25 25 6739.635 12.198
26 26 6195.187 -8.078
27 27 6788.285 9.574
28 28 5825.006 -14.190
29 29 6799.551 16.730
30 30 6674.127 -1.845
In [27]:
#we leave the rows until the year 20 to be able to compare with the graph above
df_edad_precio = df_edad_precio.drop(range(21, 31),axis=0)
df_edad_precio
Out[27]:
age price Depreciación
0 0 30033.957 NaN
1 1 28542.660 -4.965
2 2 25884.679 -9.312
3 3 22532.809 -12.949
4 4 19179.775 -14.881
5 5 18027.772 -6.006
6 6 14848.909 -17.633
7 7 13234.891 -10.870
8 8 12040.311 -9.026
9 9 11586.636 -3.768
10 10 9387.075 -18.984
11 11 8331.043 -11.250
12 12 8741.302 4.924
13 13 7490.714 -14.307
14 14 7435.246 -0.740
15 15 6497.995 -12.606
16 16 6310.352 -2.888
17 17 5739.641 -9.044
18 18 5820.298 1.405
19 19 5513.692 -5.268
20 20 5444.955 -1.247
In [28]:
media_df_edad_precio = df_edad_precio.drop(range(0, 1),axis=0)
media_df_edad_precio
#we remove the year 0
Out[28]:
age price Depreciación
1 1 28542.660 -4.965
2 2 25884.679 -9.312
3 3 22532.809 -12.949
4 4 19179.775 -14.881
5 5 18027.772 -6.006
6 6 14848.909 -17.633
7 7 13234.891 -10.870
8 8 12040.311 -9.026
9 9 11586.636 -3.768
10 10 9387.075 -18.984
11 11 8331.043 -11.250
12 12 8741.302 4.924
13 13 7490.714 -14.307
14 14 7435.246 -0.740
15 15 6497.995 -12.606
16 16 6310.352 -2.888
17 17 5739.641 -9.044
18 18 5820.298 1.405
19 19 5513.692 -5.268
20 20 5444.955 -1.247
In [29]:
media_df_edad_precio['Depreciación'].mean()
Out[29]:
-7.970666440655506
In [30]:
df_edad_precio.groupby(['age'])['price'].mean().plot(color = "#FF3364",
                                         fontsize = 12,
                                         title = 'Depreciacion vs años',
                                         xlabel = 'Años',
                                         ylabel = 'Precio medio'
                                        )
#The results aren't quite the same. In the first year, there's roughly a 5% depreciation, which contrasts with the 24% shown in the first graph. As for the remaining years, there's approximately an 8% depreciation, 
#which is significantly less than the 15% observed in the initial study."
Out[30]:

Since age is quite correlated with odometer (the older a car is, the more mileage it tends to have), the above can be repeated with odometer:

Calculate the mean price for `odometer` (group by multiples of 10000 miles), plot the results and calculate the % of decrease for each consecutive group. Do you obtain a similar depreciation curve?
As Pandas documentation clarifies, in order to group by a function on a column, you need to set first that column as the index with `set_index`.
In [31]:
#I define the 'odometer' intervals
bins = pd.interval_range(start=0, end=300000, freq=10000)
# Now I can group by the intervals of 'odometer' and calculate the average of 'price' in each group
avg_price_by_odometer = df.groupby(pd.cut(df['odometer'], bins=bins))['price'].mean().reset_index()
avg_price_by_odometer
Out[31]:
odometer price
0 (0, 10000] 20135.979
1 (10000, 20000] 27016.921
2 (20000, 30000] 24424.419
3 (30000, 40000] 22328.630
4 (40000, 50000] 19585.644
5 (50000, 60000] 17855.372
6 (60000, 70000] 16129.750
7 (70000, 80000] 14347.963
8 (80000, 90000] 13374.957
9 (90000, 100000] 12377.919
10 (100000, 110000] 11225.052
11 (110000, 120000] 10032.653
12 (120000, 130000] 9740.624
13 (130000, 140000] 8696.668
14 (140000, 150000] 8157.865
15 (150000, 160000] 7782.058
16 (160000, 170000] 7687.409
17 (170000, 180000] 6990.205
18 (180000, 190000] 7421.874
19 (190000, 200000] 6474.934
20 (200000, 210000] 6445.087
21 (210000, 220000] 5699.513
22 (220000, 230000] 5847.221
23 (230000, 240000] 6442.864
24 (240000, 250000] 6098.073
25 (250000, 260000] 5914.945
26 (260000, 270000] 6985.744
27 (270000, 280000] 7474.760
28 (280000, 290000] 5290.119
29 (290000, 300000] 5959.919
In [32]:
type(avg_price_by_odometer['odometer'])
Out[32]:
pandas.core.series.Series
In [33]:
plt.plot(avg_price_by_odometer.index,avg_price_by_odometer['price'])
plt.xlabel('Odometer Bins')
plt.ylabel('Average Price')
plt.title('Average Price by Odometer')
plt.show()
In [34]:
# Calculate the change in average price from one interval to the next. This will be the depreciation in the price for each interval.
avg_price_by_odometer['depreciation_per_interval'] = avg_price_by_odometer['price'].diff()

# Calculate depreciation in percentage terms
avg_price_by_odometer['depreciation_percentage'] = avg_price_by_odometer['depreciation_per_interval'] / avg_price_by_odometer['price'].shift() * 100

avg_price_by_odometer
Out[34]:
odometer price depreciation_per_interval depreciation_percentage
0 (0, 10000] 20135.979 NaN NaN
1 (10000, 20000] 27016.921 6880.942 34.172
2 (20000, 30000] 24424.419 -2592.503 -9.596
3 (30000, 40000] 22328.630 -2095.789 -8.581
4 (40000, 50000] 19585.644 -2742.986 -12.285
5 (50000, 60000] 17855.372 -1730.272 -8.834
6 (60000, 70000] 16129.750 -1725.621 -9.664
7 (70000, 80000] 14347.963 -1781.788 -11.047
8 (80000, 90000] 13374.957 -973.005 -6.781
9 (90000, 100000] 12377.919 -997.039 -7.455
10 (100000, 110000] 11225.052 -1152.867 -9.314
11 (110000, 120000] 10032.653 -1192.399 -10.623
12 (120000, 130000] 9740.624 -292.028 -2.911
13 (130000, 140000] 8696.668 -1043.956 -10.718
14 (140000, 150000] 8157.865 -538.804 -6.196
15 (150000, 160000] 7782.058 -375.806 -4.607
16 (160000, 170000] 7687.409 -94.649 -1.216
17 (170000, 180000] 6990.205 -697.204 -9.069
18 (180000, 190000] 7421.874 431.669 6.175
19 (190000, 200000] 6474.934 -946.940 -12.759
20 (200000, 210000] 6445.087 -29.847 -0.461
21 (210000, 220000] 5699.513 -745.574 -11.568
22 (220000, 230000] 5847.221 147.708 2.592
23 (230000, 240000] 6442.864 595.643 10.187
24 (240000, 250000] 6098.073 -344.791 -5.352
25 (250000, 260000] 5914.945 -183.128 -3.003
26 (260000, 270000] 6985.744 1070.799 18.103
27 (270000, 280000] 7474.760 489.017 7.000
28 (280000, 290000] 5290.119 -2184.641 -29.227
29 (290000, 300000] 5959.919 669.801 12.661
In [35]:
#I remove the first row
#The first year has no depreciation because there is no prior year.
avg_price_by_odometer = avg_price_by_odometer.drop(range(0, 1),axis=0)
avg_price_by_odometer
Out[35]:
odometer price depreciation_per_interval depreciation_percentage
1 (10000, 20000] 27016.921 6880.942 34.172
2 (20000, 30000] 24424.419 -2592.503 -9.596
3 (30000, 40000] 22328.630 -2095.789 -8.581
4 (40000, 50000] 19585.644 -2742.986 -12.285
5 (50000, 60000] 17855.372 -1730.272 -8.834
6 (60000, 70000] 16129.750 -1725.621 -9.664
7 (70000, 80000] 14347.963 -1781.788 -11.047
8 (80000, 90000] 13374.957 -973.005 -6.781
9 (90000, 100000] 12377.919 -997.039 -7.455
10 (100000, 110000] 11225.052 -1152.867 -9.314
11 (110000, 120000] 10032.653 -1192.399 -10.623
12 (120000, 130000] 9740.624 -292.028 -2.911
13 (130000, 140000] 8696.668 -1043.956 -10.718
14 (140000, 150000] 8157.865 -538.804 -6.196
15 (150000, 160000] 7782.058 -375.806 -4.607
16 (160000, 170000] 7687.409 -94.649 -1.216
17 (170000, 180000] 6990.205 -697.204 -9.069
18 (180000, 190000] 7421.874 431.669 6.175
19 (190000, 200000] 6474.934 -946.940 -12.759
20 (200000, 210000] 6445.087 -29.847 -0.461
21 (210000, 220000] 5699.513 -745.574 -11.568
22 (220000, 230000] 5847.221 147.708 2.592
23 (230000, 240000] 6442.864 595.643 10.187
24 (240000, 250000] 6098.073 -344.791 -5.352
25 (250000, 260000] 5914.945 -183.128 -3.003
26 (260000, 270000] 6985.744 1070.799 18.103
27 (270000, 280000] 7474.760 489.017 7.000
28 (280000, 290000] 5290.119 -2184.641 -29.227
29 (290000, 300000] 5959.919 669.801 12.661
In [36]:
avg_price_by_odometer_depr =avg_price_by_odometer.drop(range(16, 30),axis=0)
avg_price_by_odometer_depr
Out[36]:
odometer price depreciation_per_interval depreciation_percentage
1 (10000, 20000] 27016.921 6880.942 34.172
2 (20000, 30000] 24424.419 -2592.503 -9.596
3 (30000, 40000] 22328.630 -2095.789 -8.581
4 (40000, 50000] 19585.644 -2742.986 -12.285
5 (50000, 60000] 17855.372 -1730.272 -8.834
6 (60000, 70000] 16129.750 -1725.621 -9.664
7 (70000, 80000] 14347.963 -1781.788 -11.047
8 (80000, 90000] 13374.957 -973.005 -6.781
9 (90000, 100000] 12377.919 -997.039 -7.455
10 (100000, 110000] 11225.052 -1152.867 -9.314
11 (110000, 120000] 10032.653 -1192.399 -10.623
12 (120000, 130000] 9740.624 -292.028 -2.911
13 (130000, 140000] 8696.668 -1043.956 -10.718
14 (140000, 150000] 8157.865 -538.804 -6.196
15 (150000, 160000] 7782.058 -375.806 -4.607
In [37]:
avg_price_by_odometer_depr['depreciation_percentage'].mean()
#The average depreciation for the remaining years differs slightly from the previous one.
Out[37]:
-5.6291535703642115

Fair enough, as this confirms that our clean data follow the expected trend. Once we rely on our data, it's time to split them into training and testing sets:

Divide `df` into: * `X` (information of cars excluding `price`). * `Y` (`price` of those cars). Use 80% of cars for training, and the remaining 20% for testing, **Remember to use `RANDOM_STATE`** so that this division is always the same, no matter how many times you rerun your notebook.
In [38]:
#I take all the columns except price, which will be the independent variables
df = df.reset_index(drop=True)
x = df.loc[:, df.columns != 'price']
# I take the price column that will be my target variable
y = df.loc[:, df.columns == 'price']
In [39]:
x
Out[39]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_brown paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow
0 13 254217 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1 8 188406 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 10 108124 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
3 9 178054 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
4 6 170259 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
77875 8 82000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
77876 18 40000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
77877 18 58000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
77878 14 159980 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
77879 17 160000 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

77880 rows × 97 columns

In [40]:
y
Out[40]:
price
0 16995
1 13995
2 7995
3 8995
4 10995
... ...
77875 9885
77876 750
77877 4800
77878 1600
77879 9000

77880 rows × 1 columns

In [41]:
train, test = train_test_split(df, 
                               train_size = 0.80, 
                               random_state = 42) 
#I do a train-test split and set the seed
In [42]:
target = 'price'
indp_columns = [x for x in df.columns if x !='price']
In [43]:
x_train = train[indp_columns]
y_train = train[target]
In [44]:
x_test = test[indp_columns]
y_test = test[target]
In [45]:
x_train
Out[45]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_brown paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow
29645 4 72500 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
68355 17 136000 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
52304 8 109526 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
62400 14 184850 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
11259 4 70000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6265 2 58457 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
54886 15 215000 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 1 0 0
76820 13 138600 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
860 7 80000 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
15795 3 53171 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

62304 rows × 97 columns

In [46]:
y_train
Out[46]:
29645    24495
68355     6500
52304     7300
62400     3995
11259    31900
         ...  
6265     12900
54886     8995
76820     3300
860       8299
15795    40990
Name: price, Length: 62304, dtype: int32

So that now we can train a very basic model:

Use `age` to train a linear regression model, that is, find values for `w` and `b` such that the estimated price is $\hat{y} = w·age + b$. Which values do you obtain for `w` and `b`? Can you explain in plain words what this model is doing? Do you see any obvious problem if we wanted to use this model in practice?
Check the documentation of LinearRegression to know how to obtain `w` and `b`.
In [47]:
#linear regression
lr =  LinearRegression(fit_intercept=True) 

# We call the fit method to train the regression
lr.fit (x_train [['age']], y_train)
Out[47]:
LinearRegression()
In [48]:
print("Intercept: " + str(lr.intercept_))
print("(w): " + str(lr.coef_))
#y= -962.26429718.age + 22095.424788176766
#w:decreasing line
Intercept: 22095.42478817674
(w): [-962.26429718]
In [49]:
año_precio = train.groupby(['age'])['price'].mean()
año_precio = año_precio.values
train_Pmedio = pd.DataFrame()
train_Pmedio['Edad'] = np.arange(0, 31)
train_Pmedio['Precio'] = año_precio
In [50]:
x_train_Pmedio = train_Pmedio['Edad'].values
y_train_Pmedio = train_Pmedio['Precio'].values
In [51]:
w = -962.26429718
b = 22095.424788176766
In [52]:
plt.figure(figsize=(8,6))
plt.scatter(x_train_Pmedio, y_train_Pmedio, c='#33FFB2', s=60, label='Valores reales')
plt.plot ( x_train_Pmedio, w*x_train_Pmedio + b, color='#FF3352', label="Regresión Lineal")
plt.title("Regresión Lineal vs Valores reales", fontsize = 16)
plt.xlabel("Age", fontsize = 10)
plt.ylabel("Price", fontsize = 10)
plt.legend();
#Linear regression plot vs actual values

And check it out in test:

Use the linear regression model to predict for `X_test` cars. What score do you obtain? How does this compare to `X_train` cars? Why is this?
A good way to grasp what is going on is to plot with two different line plots: * What the model says for a given `age`. * The average price in the dataset for that `age`. Do this either for `X_train` and `X_test` separately, or for the whole dataset at once.
In [53]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#Metrics
mse = mean_squared_error(y_true = y_test, y_pred = lr.predict(x_test[["age"]]).reshape(-1, 1))

#MSE
rmse = np.sqrt(mse)

#MAE
mae = mean_absolute_error(y_true = y_test, y_pred = lr.predict(x_test[["age"]]).reshape(-1, 1))

print ("Train score (R2): ", lr.score(x_train [['age']], y_train))
print("MSE test: " + str(mse))
print("RMSE test: " +  str(rmse))
print("MAE test: " +  str(mae))
Train score (R2):  0.3399370598042252
MSE test: 60487159.02335366
RMSE test: 7777.349100005326
MAE test: 5876.8384209432525
In [54]:
prediccion = pd.DataFrame({'age': np.arange(0,31)})
prediccion['price'] = lr.predict(prediccion[['age']])
In [55]:
linea1 = df.groupby(['age'])['price'].mean().plot (x = 'age', y = 'price', color = "#FF3333",
                                         fontsize = 12,
                                         title = 'prediction vs average price by year',
                                         xlabel = 'Edad',
                                         ylabel = 'Precio')
linea2 = prediccion.plot (x = 'age', y = 'price', kind = 'line', color = "#93FF33") 
linea1 * linea2
#the prediction is not very good, especially at the end of the line, where it predicts negative prices
Out[55]:

Let's repeat this with odometer:

Use `odometer` to train another linear regression model, that is, find values for `w` and `b` such that the estimated price is $\hat{y} = w·odometer + b$. Which values do you obtain now for `w` and `b`? Interpret the resulting model, and see if it behaves significantly better/worse for `X_test`. Why is this if `odometer` is very correlated with `age`?
In [56]:
#Linear regression odometer
#independent variables
x = df.loc[:, df.columns != 'price']

# target: dependent variable
y = df.loc[:, df.columns == 'price']
In [57]:
target = 'price'
indp_columns = [x for x in df.columns if x !='price']

x_train = train[indp_columns]
y_train = train[target]
In [58]:
lr_millas =  LinearRegression(fit_intercept=True)
In [59]:
lr_millas.fit (x_train [['odometer']], y_train)
Out[59]:
LinearRegression()
In [60]:
print("Intercept " + str(lr_millas.intercept_))
print("(w): " + str(lr_millas.coef_))
#Intercept: 22022.01
#w: -0-085
#A decreasing linear function. The cars depreciate by an additional $0.085 per mile.
Intercept 22022.011181975868
(w): [-0.08524899]

One of the problems with the above linear regressions is that we are taking into account just one variable (either age or odometer) to predict. But there are many more features in the dataset, including all the categorical ones. Let's fit a tree to see what happens:

Train a decision tree with all features (use `max_depth=5` to try to prevent overfitting) and store it in a variable called `dt`. Do you obtain better predictions now in test?
In [61]:
target = 'price'
indp_columns = [x for x in df.columns if x !='price']

X_train = train[indp_columns]
y_train = train[target]

X_test = test[indp_columns]
y_test = test[target]
In [62]:
# Decission tree Regressor
dt = DecisionTreeRegressor(criterion="mse", 
                                 max_depth=5,
                                 min_samples_split=5,
                                 min_samples_leaf=5
                                )

# fit:
dt.fit(X_train, y_train)

# metrics:
mse_dt = mean_squared_error(y_true = y_test, y_pred = dt.predict(X_test))
rmse_dt = np.sqrt(mse_dt)

print("MSE: " + str(mse_dt))
print("RMSE: " +  str(rmse_dt))
#I manage to reduce the error by approximately 2000 dollars
#the results are better than in linear regression
MSE: 28367308.870631143
RMSE: 5326.0969640658195

One of the advantages of decision trees is that their nodes can be plot to see how the tree is deciding to split. This can be done using the plot_tree utility function. For example, the first levels can be shown as follows:

In [63]:
_, ax = plt.subplots(figsize=(20, 6))
plot_tree(dt, max_depth=2, ax=ax, feature_names=X_train.columns, impurity=False, fontsize=16);

Which shows that depending on the branch the price prediction (value) differs notably. The features that are selected most times to split are reflected in the feature_importances_ field:

Plot the importance of the 10 most important features. Which are those features? Does this make sense according to your intuition?
You may find the `argsort` function useful when trying to obtain the most important features. Once you have them, you can use a standard bar plot with feature names as `x` axis and importances as `y` axis.
In [64]:
dt.feature_importances_
Out[64]:
array([0.59090243, 0.05797833, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00101937,
       0.15162408, 0.        , 0.01146445, 0.02964704, 0.        ,
       0.04182367, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.10835136, 0.00535401, 0.00183525, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])
In [65]:
orden = dt.feature_importances_.argsort()
orden
Out[65]:
array([48, 68, 67, 66, 65, 64, 63, 62, 61, 69, 60, 58, 57, 56, 54, 51, 95,
       47, 46, 59, 73, 74, 75, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84,
       83, 82, 81, 80, 79, 78, 77, 76, 45, 44, 96, 42, 43, 22, 21, 20, 19,
       18, 17, 16, 15,  3, 14, 12, 11, 10,  9,  8,  7,  6,  5,  4, 13, 24,
       23, 26, 41, 40, 39, 38, 37, 25, 35, 34, 36, 32, 31, 30, 29, 28, 27,
       33,  2, 49, 72, 71, 52, 53, 55,  1, 70, 50,  0], dtype=int64)
In [66]:
features_importances = pd.DataFrame(columns = X_train.columns[orden])
features_importances
Out[66]:
cylinders_12 transmission_other transmission_manual transmission_automatic title_status_salvage title_status_rebuilt title_status_parts only title_status_missing title_status_lien drive_4wd ... cylinders_3 type_SUV drive_rwd cylinders_6 cylinders_8 fuel_diesel odometer drive_fwd cylinders_4 age

0 rows × 97 columns

In [67]:
features_importances.loc[0]=dt.feature_importances_[orden]
features_importances
Out[67]:
cylinders_12 transmission_other transmission_manual transmission_automatic title_status_salvage title_status_rebuilt title_status_parts only title_status_missing title_status_lien drive_4wd ... cylinders_3 type_SUV drive_rwd cylinders_6 cylinders_8 fuel_diesel odometer drive_fwd cylinders_4 age
0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.001 0.002 0.005 0.011 0.030 0.042 0.058 0.108 0.152 0.591

1 rows × 97 columns

In [68]:
Top_fi = features_importances.drop(features_importances.columns[0: 87], axis='columns')
Top_fi
Out[68]:
cylinders_3 type_SUV drive_rwd cylinders_6 cylinders_8 fuel_diesel odometer drive_fwd cylinders_4 age
0 0.001 0.002 0.005 0.011 0.030 0.042 0.058 0.108 0.152 0.591
In [69]:
Top_fi.plot.bar(fontsize = 12,
                title = 'top variables according to importance',
                xlabel = 'Decission tree',
                ylabel = 'importance')
#age, cilynders and transmission 
Out[69]:

Ensemble models¶

The model in dt should have convinced you that trees show predictive power for this dataset, but that it's not enough with a single tree. Time now to try tree ensembles:

Train tree ensembles on this dataset. Feel free to edit this notebook and to try different: * Techniques (bagging, boosting, ...). * Implementations (sklearn, xgb, ...). * Parameter selections (number of trees, depth of trees, ...). Explain your approach. What is the best model that you can come up with?
In [70]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor   
from catboost import CatBoostRegressor  
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
In [71]:
rf_reg = RandomForestRegressor(random_state = RANDOM_STATE)
extree_reg = ExtraTreesRegressor(random_state = RANDOM_STATE)
ab_reg = AdaBoostRegressor(random_state = RANDOM_STATE)
gb_reg = GradientBoostingRegressor(random_state = RANDOM_STATE)
lgb_reg = LGBMRegressor(random_state = RANDOM_STATE)
cb_reg = CatBoostRegressor(random_state = RANDOM_STATE)
xgb_reg = XGBRegressor(random_state = RANDOM_STATE)
In [72]:
modelos = [
    ('rf_reg', RandomForestRegressor()), ('extree_reg', ExtraTreesRegressor()), ('ab_reg', AdaBoostRegressor()), 
    ('gb_reg', GradientBoostingRegressor()), ('lgb_reg', LGBMRegressor()),
    ('cb_reg', CatBoostRegressor(), ('xgb_reg'), XGBRegressor())]
In [73]:
rf_reg = RandomForestRegressor (n_estimators = 300,
          max_depth = 30,
          min_samples_split = 10,
          ccp_alpha = 0.1,
          min_samples_leaf = 5)
                

rf_reg.fit(X_train, y_train)


print('score train: ', rf_reg.score (X_train, y_train))
print('score test: ', rf_reg.score (X_test, y_test))
#Overfitting
score train:  0.9261437802352984
score test:  0.8704634725386805
In [74]:
rf_reg2 = RandomForestRegressor(n_estimators = 150,
           max_depth = 30,
           max_features = "sqrt",
           ccp_alpha = 0.1,
           min_samples_split =  10,
           min_samples_leaf = 5
         )

rf_reg2.fit(X_train, y_train)


print('score train: ', rf_reg2.score (X_train, y_train))
print('score test: ', rf_reg2.score (X_test, y_test))
#Overfitting
score train:  0.8523821212629688
score test:  0.8308471404809338
In [75]:
rf_reg3 = RandomForestRegressor(n_estimators = 150,
           max_depth = 10,
           max_features = "sqrt",
           ccp_alpha = 0.1,
           min_samples_split = 10,
           min_samples_leaf = 5
         )

rf_reg3.fit(X_train, y_train)


print('score train: ', rf_reg3.score (X_train, y_train))
print('score test: ', rf_reg3.score (X_test, y_test))
score train:  0.7471418082581973
score test:  0.7375122425853795
In [76]:
extree_reg = ExtraTreesRegressor (n_estimators = 300,
          max_depth = 30,
          max_features = "sqrt",
          min_samples_split = 10,
          ccp_alpha = 0.1,
          min_samples_leaf = 5)
                

extree_reg.fit(X_train, y_train)


print('score train: ', extree_reg.score (X_train, y_train))
print('score test: ', extree_reg.score (X_test, y_test))
score train:  0.8020990473006532
score test:  0.7788793049512839
In [77]:
extree_reg2 = ExtraTreesRegressor(n_estimators = 150,
           max_depth = 30,
           max_features = "sqrt",
           ccp_alpha = 0.1,
           min_samples_split =  2,
           min_samples_leaf = 1
         )

extree_reg2.fit(X_train, y_train)


print('score train: ', extree_reg2.score (X_train, y_train))
print('score test: ', extree_reg2.score (X_test, y_test))
#I've reduced the number of trees but overfitting has increased
score train:  0.9760695664157354
score test:  0.8754761478170683
In [78]:
grid_ab_reg = {"n_estimators": [150, 300, 500],
               "learning_rate": [1.0] 
              }

gs_ab_reg = GridSearchCV(estimator = ab_reg,
                        param_grid = grid_ab_reg,
                        scoring    = 'r2',
                        n_jobs=-1,
                        cv=10,
                        verbose=1)

gs_ab_reg.fit(X_train, y_train)

print('Mejor selección: ', gs_ab_reg.best_estimator_)
print('Score train: ', gs_ab_reg.best_score_)
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Mejor selección:  AdaBoostRegressor(n_estimators=150, random_state=42)
Score train:  0.5308523379539066
In [79]:
grid_gb_reg  = {"learning_rate": [0.1, 0.5],
                "n_estimators": [50,100],
                "min_samples_split": [1, 5],
                "min_samples_leaf": [2, 4],
                "max_depth": [1,5,9],
                }

# GridSearch: 
gs_gb_reg = GridSearchCV(estimator = gb_reg,
                        param_grid = grid_gb_reg,
                        scoring    = 'r2',
                        n_jobs=-1,
                        cv=10,
                        verbose=1)

gs_gb_reg.fit(X_train, y_train)

print('Mejor selección: ', gs_gb_reg.best_estimator_)
print('Score train: ', gs_gb_reg.best_score_)
Fitting 10 folds for each of 48 candidates, totalling 480 fits
Mejor selección:  GradientBoostingRegressor(max_depth=9, min_samples_leaf=2, min_samples_split=5,
                          random_state=42)
Score train:  0.8826310919750598
In [80]:
print('score test: ', gs_gb_reg.score (X_test, y_test))
score test:  0.8806625226588694
In [81]:
grid_lgb_reg = {"num_leaves": [7, 14, 21, 28, 31, 50],
                "learning_rate": [0.1, 0.03],
                "n_estimators": [150,300],
                "max_depth": [-1,3,5,9,11],
                }

# GridSearch: 
gs_lgb_reg = GridSearchCV(estimator = lgb_reg,
                        param_grid = grid_lgb_reg,
                        scoring    = 'r2',
                        n_jobs=-1,
                        cv=10,
                        verbose=1)

gs_lgb_reg.fit(X_train, y_train)

print('Mejor selección: ', gs_lgb_reg.best_estimator_)
print('Score train: ', gs_lgb_reg.best_score_)
Fitting 10 folds for each of 120 candidates, totalling 1200 fits
Mejor selección:  LGBMRegressor(n_estimators=300, num_leaves=50, random_state=42)
Score train:  0.8815811343287233
In [82]:
# Grid de hiperparámetros:
grid_cb_reg = {"iterations": [100, 200, 300],
               "learning_rate": [0.03, 0.1, 0.5],
               "depth": [3, 9],
               "random_strength": [0.2, 0.5, 1, 3]
                }

# GridSearch: 
gs_cb_reg = GridSearchCV(estimator = cb_reg,
                        param_grid = grid_cb_reg,
                        scoring    = 'r2',
                        n_jobs=-1,
                        cv=10,
                        verbose=1)

gs_cb_reg.fit(X_train, y_train)

print('Mejor selección: ', gs_cb_reg.best_estimator_)
print('Score train: ', gs_cb_reg.best_score_)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
0:	learn: 6435.0575113	total: 166ms	remaining: 49.5s
1:	learn: 5194.1233271	total: 183ms	remaining: 27.3s
2:	learn: 4614.8593157	total: 200ms	remaining: 19.8s
3:	learn: 4354.2996171	total: 217ms	remaining: 16.1s
4:	learn: 4184.3281913	total: 234ms	remaining: 13.8s
5:	learn: 4090.6489786	total: 251ms	remaining: 12.3s
6:	learn: 4016.0343241	total: 267ms	remaining: 11.2s
7:	learn: 3976.2338440	total: 284ms	remaining: 10.4s
8:	learn: 3920.9469790	total: 303ms	remaining: 9.78s
9:	learn: 3869.1175106	total: 319ms	remaining: 9.24s
10:	learn: 3826.5859804	total: 338ms	remaining: 8.87s
11:	learn: 3793.6420583	total: 368ms	remaining: 8.84s
12:	learn: 3760.9819132	total: 402ms	remaining: 8.88s
13:	learn: 3741.1394924	total: 421ms	remaining: 8.6s
14:	learn: 3711.5111046	total: 444ms	remaining: 8.43s
15:	learn: 3692.8604979	total: 462ms	remaining: 8.21s
16:	learn: 3668.8791352	total: 480ms	remaining: 8s
17:	learn: 3652.5281991	total: 499ms	remaining: 7.82s
18:	learn: 3614.6105095	total: 518ms	remaining: 7.67s
19:	learn: 3596.9250794	total: 536ms	remaining: 7.51s
20:	learn: 3585.2377403	total: 555ms	remaining: 7.38s
21:	learn: 3569.7823145	total: 573ms	remaining: 7.24s
22:	learn: 3547.7422642	total: 594ms	remaining: 7.16s
23:	learn: 3534.3671556	total: 618ms	remaining: 7.11s
24:	learn: 3514.6886666	total: 640ms	remaining: 7.04s
25:	learn: 3495.8946048	total: 659ms	remaining: 6.94s
26:	learn: 3481.6069594	total: 677ms	remaining: 6.84s
27:	learn: 3470.2079406	total: 695ms	remaining: 6.75s
28:	learn: 3446.8092016	total: 713ms	remaining: 6.67s
29:	learn: 3427.0861584	total: 733ms	remaining: 6.6s
30:	learn: 3412.8999998	total: 757ms	remaining: 6.57s
31:	learn: 3405.2466042	total: 777ms	remaining: 6.51s
32:	learn: 3387.7019719	total: 799ms	remaining: 6.46s
33:	learn: 3377.5416207	total: 824ms	remaining: 6.45s
34:	learn: 3363.6473267	total: 848ms	remaining: 6.42s
35:	learn: 3352.5042625	total: 871ms	remaining: 6.39s
36:	learn: 3347.4901265	total: 890ms	remaining: 6.33s
37:	learn: 3336.4869666	total: 911ms	remaining: 6.28s
38:	learn: 3322.6411015	total: 931ms	remaining: 6.23s
39:	learn: 3312.8176748	total: 950ms	remaining: 6.18s
40:	learn: 3303.7794669	total: 967ms	remaining: 6.11s
41:	learn: 3292.9941103	total: 985ms	remaining: 6.05s
42:	learn: 3289.0305148	total: 1s	remaining: 5.99s
43:	learn: 3278.8806966	total: 1.03s	remaining: 5.97s
44:	learn: 3261.8821679	total: 1.05s	remaining: 5.96s
45:	learn: 3257.4616576	total: 1.07s	remaining: 5.92s
46:	learn: 3248.0766410	total: 1.09s	remaining: 5.87s
47:	learn: 3233.0625499	total: 1.11s	remaining: 5.82s
48:	learn: 3227.7758132	total: 1.13s	remaining: 5.78s
49:	learn: 3216.9585403	total: 1.15s	remaining: 5.75s
50:	learn: 3205.7267089	total: 1.17s	remaining: 5.71s
51:	learn: 3202.2572991	total: 1.19s	remaining: 5.66s
52:	learn: 3196.1935529	total: 1.21s	remaining: 5.62s
53:	learn: 3192.1681258	total: 1.23s	remaining: 5.58s
54:	learn: 3184.7819725	total: 1.25s	remaining: 5.55s
55:	learn: 3175.6385500	total: 1.27s	remaining: 5.54s
56:	learn: 3169.3637084	total: 1.29s	remaining: 5.51s
57:	learn: 3162.0195281	total: 1.31s	remaining: 5.48s
58:	learn: 3154.3393886	total: 1.33s	remaining: 5.44s
59:	learn: 3148.8559108	total: 1.35s	remaining: 5.4s
60:	learn: 3141.3330644	total: 1.37s	remaining: 5.37s
61:	learn: 3132.6465558	total: 1.39s	remaining: 5.33s
62:	learn: 3126.0904099	total: 1.41s	remaining: 5.3s
63:	learn: 3116.1633095	total: 1.42s	remaining: 5.25s
64:	learn: 3108.5382671	total: 1.44s	remaining: 5.22s
65:	learn: 3099.8123254	total: 1.47s	remaining: 5.2s
66:	learn: 3092.0807055	total: 1.49s	remaining: 5.19s
67:	learn: 3084.7899055	total: 1.51s	remaining: 5.16s
68:	learn: 3077.1775934	total: 1.53s	remaining: 5.13s
69:	learn: 3070.6885755	total: 1.55s	remaining: 5.1s
70:	learn: 3062.8860598	total: 1.57s	remaining: 5.07s
71:	learn: 3059.3199367	total: 1.59s	remaining: 5.05s
72:	learn: 3053.5898434	total: 1.61s	remaining: 5.02s
73:	learn: 3046.7184438	total: 1.63s	remaining: 4.98s
74:	learn: 3044.4028855	total: 1.65s	remaining: 4.95s
75:	learn: 3036.5127718	total: 1.67s	remaining: 4.92s
76:	learn: 3032.6835102	total: 1.69s	remaining: 4.89s
77:	learn: 3026.6082409	total: 1.71s	remaining: 4.88s
78:	learn: 3015.8495558	total: 1.74s	remaining: 4.86s
79:	learn: 3012.6783052	total: 1.76s	remaining: 4.83s
80:	learn: 3008.2269548	total: 1.77s	remaining: 4.8s
81:	learn: 3002.1445406	total: 1.79s	remaining: 4.77s
82:	learn: 2998.6921865	total: 1.81s	remaining: 4.74s
83:	learn: 2991.6592872	total: 1.83s	remaining: 4.71s
84:	learn: 2983.0559012	total: 1.85s	remaining: 4.69s
85:	learn: 2978.8719046	total: 1.87s	remaining: 4.66s
86:	learn: 2975.1527962	total: 1.89s	remaining: 4.63s
87:	learn: 2968.5163570	total: 1.91s	remaining: 4.61s
88:	learn: 2964.2521420	total: 1.94s	remaining: 4.59s
89:	learn: 2960.0459732	total: 1.96s	remaining: 4.57s
90:	learn: 2954.5277481	total: 1.98s	remaining: 4.54s
91:	learn: 2950.2775196	total: 2s	remaining: 4.51s
92:	learn: 2946.7894058	total: 2.02s	remaining: 4.49s
93:	learn: 2942.6919742	total: 2.04s	remaining: 4.47s
94:	learn: 2938.1521919	total: 2.06s	remaining: 4.44s
95:	learn: 2934.0822594	total: 2.07s	remaining: 4.41s
96:	learn: 2926.2332088	total: 2.09s	remaining: 4.38s
97:	learn: 2922.7756868	total: 2.11s	remaining: 4.35s
98:	learn: 2912.7873207	total: 2.13s	remaining: 4.33s
99:	learn: 2910.9889520	total: 2.15s	remaining: 4.3s
100:	learn: 2907.4645973	total: 2.17s	remaining: 4.28s
101:	learn: 2898.9339915	total: 2.19s	remaining: 4.26s
102:	learn: 2896.0940306	total: 2.21s	remaining: 4.23s
103:	learn: 2888.9555715	total: 2.23s	remaining: 4.21s
104:	learn: 2884.3369760	total: 2.25s	remaining: 4.18s
105:	learn: 2876.7747319	total: 2.27s	remaining: 4.16s
106:	learn: 2871.5527228	total: 2.29s	remaining: 4.13s
107:	learn: 2867.7893712	total: 2.31s	remaining: 4.11s
108:	learn: 2864.0985664	total: 2.37s	remaining: 4.15s
109:	learn: 2859.0235025	total: 2.39s	remaining: 4.13s
110:	learn: 2856.1162041	total: 2.42s	remaining: 4.11s
111:	learn: 2852.3310850	total: 2.43s	remaining: 4.09s
112:	learn: 2846.0348800	total: 2.46s	remaining: 4.07s
113:	learn: 2842.5079572	total: 2.48s	remaining: 4.04s
114:	learn: 2838.7404528	total: 2.5s	remaining: 4.02s
115:	learn: 2835.7568033	total: 2.52s	remaining: 3.99s
116:	learn: 2832.3121783	total: 2.54s	remaining: 3.97s
117:	learn: 2828.2519342	total: 2.56s	remaining: 3.94s
118:	learn: 2824.9035127	total: 2.58s	remaining: 3.92s
119:	learn: 2819.7277274	total: 2.6s	remaining: 3.9s
120:	learn: 2815.5542269	total: 2.62s	remaining: 3.88s
121:	learn: 2811.6318899	total: 2.64s	remaining: 3.85s
122:	learn: 2806.3458971	total: 2.65s	remaining: 3.82s
123:	learn: 2804.1917156	total: 2.67s	remaining: 3.8s
124:	learn: 2800.0994551	total: 2.69s	remaining: 3.77s
125:	learn: 2796.9184052	total: 2.71s	remaining: 3.75s
126:	learn: 2790.4601986	total: 2.73s	remaining: 3.72s
127:	learn: 2789.5690580	total: 2.75s	remaining: 3.7s
128:	learn: 2788.4653367	total: 2.77s	remaining: 3.67s
129:	learn: 2783.5127232	total: 2.79s	remaining: 3.65s
130:	learn: 2779.9807156	total: 2.81s	remaining: 3.63s
131:	learn: 2777.1546283	total: 2.83s	remaining: 3.61s
132:	learn: 2774.3216541	total: 2.86s	remaining: 3.59s
133:	learn: 2770.1664251	total: 2.88s	remaining: 3.56s
134:	learn: 2763.5489993	total: 2.89s	remaining: 3.54s
135:	learn: 2760.8648016	total: 2.91s	remaining: 3.51s
136:	learn: 2759.5762493	total: 2.93s	remaining: 3.49s
137:	learn: 2756.5814858	total: 2.95s	remaining: 3.46s
138:	learn: 2753.1751000	total: 2.97s	remaining: 3.44s
139:	learn: 2749.3518160	total: 2.99s	remaining: 3.41s
140:	learn: 2745.8234843	total: 3.01s	remaining: 3.4s
141:	learn: 2739.0352353	total: 3.03s	remaining: 3.37s
142:	learn: 2735.6354369	total: 3.05s	remaining: 3.35s
143:	learn: 2729.7815592	total: 3.07s	remaining: 3.33s
144:	learn: 2725.8207388	total: 3.09s	remaining: 3.31s
145:	learn: 2719.6153255	total: 3.11s	remaining: 3.28s
146:	learn: 2715.7367810	total: 3.13s	remaining: 3.26s
147:	learn: 2712.6892948	total: 3.15s	remaining: 3.24s
148:	learn: 2708.0471969	total: 3.17s	remaining: 3.21s
149:	learn: 2703.8629867	total: 3.19s	remaining: 3.19s
150:	learn: 2700.0992582	total: 3.21s	remaining: 3.17s
151:	learn: 2698.3763745	total: 3.23s	remaining: 3.15s
152:	learn: 2695.4965291	total: 3.26s	remaining: 3.13s
153:	learn: 2693.7759263	total: 3.28s	remaining: 3.11s
154:	learn: 2692.0653007	total: 3.3s	remaining: 3.08s
155:	learn: 2690.4429784	total: 3.31s	remaining: 3.06s
156:	learn: 2683.0178345	total: 3.33s	remaining: 3.04s
157:	learn: 2680.3969856	total: 3.35s	remaining: 3.01s
158:	learn: 2678.8322531	total: 3.37s	remaining: 2.99s
159:	learn: 2674.7737978	total: 3.39s	remaining: 2.97s
160:	learn: 2670.8520986	total: 3.41s	remaining: 2.94s
161:	learn: 2666.5752844	total: 3.43s	remaining: 2.92s
162:	learn: 2662.0612348	total: 3.45s	remaining: 2.9s
163:	learn: 2658.9172892	total: 3.47s	remaining: 2.88s
164:	learn: 2656.3127579	total: 3.49s	remaining: 2.85s
165:	learn: 2654.2956245	total: 3.51s	remaining: 2.83s
166:	learn: 2650.8342006	total: 3.52s	remaining: 2.81s
167:	learn: 2647.8452631	total: 3.54s	remaining: 2.79s
168:	learn: 2645.4351021	total: 3.56s	remaining: 2.76s
169:	learn: 2643.4199372	total: 3.58s	remaining: 2.74s
170:	learn: 2637.8088384	total: 3.6s	remaining: 2.72s
171:	learn: 2634.8963781	total: 3.62s	remaining: 2.7s
172:	learn: 2631.4148077	total: 3.64s	remaining: 2.67s
173:	learn: 2628.9342179	total: 3.67s	remaining: 2.65s
174:	learn: 2626.9344642	total: 3.69s	remaining: 2.63s
175:	learn: 2624.1794277	total: 3.71s	remaining: 2.61s
176:	learn: 2621.0677407	total: 3.72s	remaining: 2.59s
177:	learn: 2619.0864622	total: 3.75s	remaining: 2.57s
178:	learn: 2616.4262417	total: 3.77s	remaining: 2.54s
179:	learn: 2611.8023591	total: 3.79s	remaining: 2.52s
180:	learn: 2607.4758502	total: 3.8s	remaining: 2.5s
181:	learn: 2604.8386560	total: 3.83s	remaining: 2.48s
182:	learn: 2602.7118129	total: 3.84s	remaining: 2.46s
183:	learn: 2600.0099984	total: 3.87s	remaining: 2.44s
184:	learn: 2595.3616868	total: 3.89s	remaining: 2.42s
185:	learn: 2590.7486350	total: 3.91s	remaining: 2.4s
186:	learn: 2588.7296346	total: 3.93s	remaining: 2.38s
187:	learn: 2585.1133526	total: 3.95s	remaining: 2.35s
188:	learn: 2583.6693594	total: 3.97s	remaining: 2.33s
189:	learn: 2581.2657345	total: 3.98s	remaining: 2.31s
190:	learn: 2579.2343592	total: 4.01s	remaining: 2.29s
191:	learn: 2574.2903214	total: 4.03s	remaining: 2.26s
192:	learn: 2571.4414223	total: 4.04s	remaining: 2.24s
193:	learn: 2568.8347581	total: 4.06s	remaining: 2.22s
194:	learn: 2566.4013266	total: 4.08s	remaining: 2.2s
195:	learn: 2563.6033367	total: 4.1s	remaining: 2.18s
196:	learn: 2559.6417802	total: 4.12s	remaining: 2.16s
197:	learn: 2556.6837746	total: 4.14s	remaining: 2.13s
198:	learn: 2553.7559952	total: 4.16s	remaining: 2.11s
199:	learn: 2550.2036534	total: 4.18s	remaining: 2.09s
200:	learn: 2548.9971420	total: 4.2s	remaining: 2.07s
201:	learn: 2546.8813073	total: 4.22s	remaining: 2.05s
202:	learn: 2542.8649668	total: 4.24s	remaining: 2.02s
203:	learn: 2540.2543083	total: 4.26s	remaining: 2s
204:	learn: 2538.1358868	total: 4.28s	remaining: 1.98s
205:	learn: 2536.2642955	total: 4.3s	remaining: 1.96s
206:	learn: 2533.9119102	total: 4.32s	remaining: 1.94s
207:	learn: 2528.1194337	total: 4.34s	remaining: 1.92s
208:	learn: 2525.6137601	total: 4.36s	remaining: 1.9s
209:	learn: 2524.0504020	total: 4.38s	remaining: 1.88s
210:	learn: 2520.2937581	total: 4.4s	remaining: 1.85s
211:	learn: 2518.0978872	total: 4.42s	remaining: 1.83s
212:	learn: 2515.0674585	total: 4.44s	remaining: 1.81s
213:	learn: 2512.6230409	total: 4.46s	remaining: 1.79s
214:	learn: 2509.6020012	total: 4.48s	remaining: 1.77s
215:	learn: 2507.8875596	total: 4.5s	remaining: 1.75s
216:	learn: 2505.9268766	total: 4.51s	remaining: 1.73s
217:	learn: 2504.9292868	total: 4.56s	remaining: 1.71s
218:	learn: 2503.1386164	total: 4.58s	remaining: 1.69s
219:	learn: 2501.0653685	total: 4.59s	remaining: 1.67s
220:	learn: 2498.9974693	total: 4.61s	remaining: 1.65s
221:	learn: 2496.0286586	total: 4.63s	remaining: 1.63s
222:	learn: 2494.2660921	total: 4.65s	remaining: 1.61s
223:	learn: 2492.2409663	total: 4.67s	remaining: 1.58s
224:	learn: 2490.3975768	total: 4.69s	remaining: 1.56s
225:	learn: 2487.8069637	total: 4.71s	remaining: 1.54s
226:	learn: 2484.6277515	total: 4.73s	remaining: 1.52s
227:	learn: 2482.8896285	total: 4.76s	remaining: 1.5s
228:	learn: 2479.3549975	total: 4.78s	remaining: 1.48s
229:	learn: 2475.7218650	total: 4.8s	remaining: 1.46s
230:	learn: 2474.8583222	total: 4.81s	remaining: 1.44s
231:	learn: 2473.3566172	total: 4.83s	remaining: 1.42s
232:	learn: 2470.3861361	total: 4.85s	remaining: 1.4s
233:	learn: 2467.9915732	total: 4.87s	remaining: 1.37s
234:	learn: 2465.3891239	total: 4.89s	remaining: 1.35s
235:	learn: 2462.8645457	total: 4.91s	remaining: 1.33s
236:	learn: 2459.6978387	total: 4.93s	remaining: 1.31s
237:	learn: 2458.3786441	total: 4.95s	remaining: 1.29s
238:	learn: 2456.3833809	total: 4.98s	remaining: 1.27s
239:	learn: 2454.4339435	total: 5s	remaining: 1.25s
240:	learn: 2451.7606019	total: 5.01s	remaining: 1.23s
241:	learn: 2449.1438891	total: 5.03s	remaining: 1.21s
242:	learn: 2447.9914282	total: 5.05s	remaining: 1.18s
243:	learn: 2444.9214230	total: 5.07s	remaining: 1.16s
244:	learn: 2440.7150517	total: 5.09s	remaining: 1.14s
245:	learn: 2436.5246796	total: 5.11s	remaining: 1.12s
246:	learn: 2433.9491893	total: 5.13s	remaining: 1.1s
247:	learn: 2431.5627637	total: 5.15s	remaining: 1.08s
248:	learn: 2430.0273642	total: 5.17s	remaining: 1.06s
249:	learn: 2428.7527475	total: 5.19s	remaining: 1.04s
250:	learn: 2424.5047895	total: 5.21s	remaining: 1.02s
251:	learn: 2423.5021186	total: 5.23s	remaining: 996ms
252:	learn: 2422.0652624	total: 5.25s	remaining: 975ms
253:	learn: 2418.1207629	total: 5.27s	remaining: 955ms
254:	learn: 2417.0179021	total: 5.29s	remaining: 934ms
255:	learn: 2415.4887237	total: 5.31s	remaining: 912ms
256:	learn: 2413.6604080	total: 5.33s	remaining: 891ms
257:	learn: 2412.6497126	total: 5.35s	remaining: 871ms
258:	learn: 2410.6288402	total: 5.37s	remaining: 850ms
259:	learn: 2409.1988976	total: 5.39s	remaining: 829ms
260:	learn: 2407.6792740	total: 5.41s	remaining: 808ms
261:	learn: 2404.9377250	total: 5.42s	remaining: 787ms
262:	learn: 2400.8398693	total: 5.45s	remaining: 767ms
263:	learn: 2395.9763618	total: 5.47s	remaining: 746ms
264:	learn: 2391.4009628	total: 5.49s	remaining: 725ms
265:	learn: 2389.9816519	total: 5.51s	remaining: 704ms
266:	learn: 2385.9111771	total: 5.53s	remaining: 683ms
267:	learn: 2383.9772362	total: 5.55s	remaining: 662ms
268:	learn: 2381.3434401	total: 5.57s	remaining: 641ms
269:	learn: 2378.7573890	total: 5.59s	remaining: 621ms
270:	learn: 2376.5264175	total: 5.61s	remaining: 600ms
271:	learn: 2375.1055527	total: 5.63s	remaining: 580ms
272:	learn: 2373.1316252	total: 5.65s	remaining: 559ms
273:	learn: 2372.5697080	total: 5.67s	remaining: 538ms
274:	learn: 2371.7149805	total: 5.69s	remaining: 517ms
275:	learn: 2370.4321563	total: 5.71s	remaining: 496ms
276:	learn: 2369.6868366	total: 5.73s	remaining: 476ms
277:	learn: 2366.5589271	total: 5.75s	remaining: 455ms
278:	learn: 2364.5949262	total: 5.77s	remaining: 434ms
279:	learn: 2362.7275188	total: 5.79s	remaining: 413ms
280:	learn: 2359.4790672	total: 5.81s	remaining: 393ms
281:	learn: 2356.9582384	total: 5.83s	remaining: 372ms
282:	learn: 2355.2890264	total: 5.85s	remaining: 352ms
283:	learn: 2353.2774474	total: 5.87s	remaining: 331ms
284:	learn: 2350.3137395	total: 5.89s	remaining: 310ms
285:	learn: 2349.1723162	total: 5.91s	remaining: 289ms
286:	learn: 2347.0432612	total: 5.93s	remaining: 269ms
287:	learn: 2344.5699555	total: 5.95s	remaining: 248ms
288:	learn: 2343.9683648	total: 5.97s	remaining: 227ms
289:	learn: 2342.3714576	total: 5.98s	remaining: 206ms
290:	learn: 2340.2806260	total: 6s	remaining: 186ms
291:	learn: 2338.6507973	total: 6.02s	remaining: 165ms
292:	learn: 2336.4315345	total: 6.04s	remaining: 144ms
293:	learn: 2335.1754544	total: 6.07s	remaining: 124ms
294:	learn: 2333.8364553	total: 6.09s	remaining: 103ms
295:	learn: 2332.1175752	total: 6.11s	remaining: 82.5ms
296:	learn: 2330.2244001	total: 6.12s	remaining: 61.9ms
297:	learn: 2328.8042956	total: 6.14s	remaining: 41.2ms
298:	learn: 2327.0888459	total: 6.16s	remaining: 20.6ms
299:	learn: 2323.2451682	total: 6.18s	remaining: 0us
Mejor selección:  <catboost.core.CatBoostRegressor object at 0x000002453A0B7B80>
Score train:  0.8863684085128496
In [83]:
print('score test: ', gs_cb_reg.score (X_test, y_test))
score test:  0.8865574832900184
In [84]:
mse = mean_squared_error(y_true=y_test,
                         y_pred=gs_cb_reg.predict(X_test))

rmse = np.sqrt(mse)

print("Catboost Regressor tiene un RMSE en el conjunto de test de "  +  str(rmse))
Catboost Regressor tiene un RMSE en el conjunto de test de 3216.17001774777
In [85]:
grid_xgb_reg = {"n_estimators": [100, 300],
                "max_depth": [3, 9, 11],
                "subsample": [0.03, 0.1, 0.5],
                "learning_rate": [0.03, 0.1],
                "colsample_bytree": [0.3, 0.7]
               }

# GridSearch: 
gs_xgb_reg = GridSearchCV(estimator = xgb_reg,
                        param_grid = grid_xgb_reg,
                        scoring    = 'r2',
                        n_jobs=-1,
                        cv=10,
                        verbose=1)

gs_xgb_reg.fit(X_train, y_train)

print('Mejor selección: ', gs_xgb_reg.best_estimator_)
print('Score train: ', gs_xgb_reg.best_score_, 2)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
Mejor selección:  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=11, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=300, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=42, ...)
Score train:  0.8950928104957875 2
In [86]:
from sklearn.metrics import r2_score

r2 = r2_score(y_true=y_test,
                         y_pred=gs_xgb_reg .predict(X_test))

mse = mean_squared_error(y_true=y_test,
                         y_pred=gs_xgb_reg.predict(X_test))

rmse = np.sqrt(mse)

print("El modelo XGBoost Regressor tiene un R2 en el conjunto de test de "  +  str(r2))
print("El modelo XGBoost Regressor tiene un RMSE en el conjunto de test de "  +  str(rmse))
El modelo XGBoost Regressor tiene un R2 en el conjunto de test de 0.8971292992427436
El modelo XGBoost Regressor tiene un RMSE en el conjunto de test de 3062.6468500761334
In [87]:
conjunto_de_GridSearchs = [gs_ab_reg ,
                            gs_gb_reg ,
                            gs_lgb_reg ,
                            gs_cb_reg ,
                            gs_xgb_reg ]
In [88]:
best_score = [gs_ab_reg.best_score_, gs_gb_reg.best_score_, gs_lgb_reg.best_score_, gs_cb_reg.best_score_,
              gs_xgb_reg.best_score_]
In [89]:
best_score
#XGBoost Regressor.
Out[89]:
[0.5308523379539066,
 0.8826310919750598,
 0.8815811343287233,
 0.8863684085128496,
 0.8950928104957875]
In [90]:
mejor_puntuación_GridSearch = conjunto_de_GridSearchs[4].best_estimator_
mejor_puntuación_GridSearch
Out[90]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=11, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=300, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=42, ...)
Plot the importance of the 10 most important features for the best model you find. Which are those features? Are they the same and in the same proportion than for `dt`? Why/why not?
In [91]:
gs_xgb_reg = gs_xgb_reg.best_estimator_

gs_xgb_reg.feature_importances_
Out[91]:
array([0.03375672, 0.01227906, 0.00611096, 0.00232226, 0.00749108,
       0.0086685 , 0.00890077, 0.00258356, 0.00997281, 0.0057821 ,
       0.00541655, 0.00863947, 0.0071637 , 0.00497436, 0.00584285,
       0.00225383, 0.00507726, 0.00586527, 0.00382546, 0.00896826,
       0.00882643, 0.00705131, 0.0015811 , 0.01504467, 0.00641449,
       0.00246683, 0.00970755, 0.01240706, 0.00198961, 0.00934782,
       0.0074902 , 0.00280123, 0.00522738, 0.00567962, 0.01483084,
       0.00396883, 0.00428546, 0.0295811 , 0.00953096, 0.00407492,
       0.00774525, 0.00310166, 0.01081623, 0.0040716 , 0.00367444,
       0.00620251, 0.00837924, 0.02075411, 0.01137489, 0.00929384,
       0.06406159, 0.00265261, 0.02235937, 0.08456504, 0.00649089,
       0.06493517, 0.0038346 , 0.027018  , 0.00761793, 0.00659274,
       0.00420821, 0.00419161, 0.00172993, 0.00273793, 0.00705998,
       0.00626196, 0.0047779 , 0.00672948, 0.00997643, 0.01427866,
       0.08461848, 0.00823848, 0.00791812, 0.01320508, 0.00768335,
       0.00474341, 0.00507872, 0.00465267, 0.01084319, 0.00522084,
       0.00981415, 0.01178969, 0.02002072, 0.00870402, 0.0029592 ,
       0.00309086, 0.00249973, 0.00271275, 0.00467182, 0.00298751,
       0.00227844, 0.00488396, 0.00278622, 0.00270417, 0.00292249,
       0.00386328, 0.00341268], dtype=float32)
In [92]:
sort = gs_xgb_reg.feature_importances_.argsort()
In [93]:
features_importance = pd.DataFrame(columns = X_train.columns[sort])
features_importance
Out[93]:
manufacturer_land rover title_status_missing manufacturer_mini manufacturer_harley-davidson paint_color_grey manufacturer_alfa-romeo manufacturer_mazda paint_color_blue manufacturer_buick cylinders_5 ... type_truck cylinders_10 cylinders_6 fuel_gas manufacturer_tesla age cylinders_4 fuel_diesel cylinders_8 drive_fwd

0 rows × 97 columns

In [94]:
features_importance.loc[0] = gs_xgb_reg.feature_importances_[sort]
features_importance
Out[94]:
manufacturer_land rover title_status_missing manufacturer_mini manufacturer_harley-davidson paint_color_grey manufacturer_alfa-romeo manufacturer_mazda paint_color_blue manufacturer_buick cylinders_5 ... type_truck cylinders_10 cylinders_6 fuel_gas manufacturer_tesla age cylinders_4 fuel_diesel cylinders_8 drive_fwd
0 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.003 0.003 ... 0.020 0.021 0.022 0.027 0.030 0.034 0.064 0.065 0.085 0.085

1 rows × 97 columns

In [95]:
Top_fi = features_importance.drop(features_importance.columns[0: 87], axis='columns')
Top_fi
Out[95]:
type_truck cylinders_10 cylinders_6 fuel_gas manufacturer_tesla age cylinders_4 fuel_diesel cylinders_8 drive_fwd
0 0.020 0.021 0.022 0.027 0.030 0.034 0.064 0.065 0.085 0.085
In [96]:
Top_fi.plot.bar(fontsize = 12, rot=35,
                title = 'Top XGBoost Features According to Their Importance',
                xlabel = 'XGBoost Regressor Features',
                ylabel = 'Importancia'
               )
Out[96]:
Compute the predictions for your best model, and plot them against the real prices in `Y_test`. Are you surprised by the results? Do the same for `Y_train` and see if the behavior has degraded considerably or not.
In [97]:
predXgboost_train = gs_xgb_reg.predict(X_train)

predXgboost_train
Out[97]:
array([19479.406 ,  7969.9023,  7970.1777, ...,  3756.1501,  6835.316 ,
       39225.723 ], dtype=float32)
In [98]:
predXgboost_train_df = X_train
In [99]:
predXgboost_train_df["predXgboost_train"] = predXgboost_train
In [100]:
predXgboost_train_df.head()
Out[100]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_train
29645 4 72500 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 1 0 19479.406
68355 17 136000 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 7969.902
52304 8 109526 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 7970.178
62400 14 184850 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 5868.347
11259 4 70000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 30769.883

5 rows × 98 columns

In [101]:
predXgboost_train_df['price'] = y_train
In [102]:
predXgboost_train_df.head()
Out[102]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_train price
29645 4 72500 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 1 0 19479.406 24495
68355 17 136000 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 7969.902 6500
52304 8 109526 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 7970.178 7300
62400 14 184850 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 5868.347 3995
11259 4 70000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 30769.883 31900

5 rows × 99 columns

In [103]:
predXgboost_train_df["diferencia"] = predXgboost_train_df["predXgboost_train"] - y_train
In [104]:
predXgboost_train_df = predXgboost_train_df.sort_values(by = "price", ascending = False)
In [105]:
predXgboost_train_df
Out[105]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_train price diferencia
72777 0 4700 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 49176.090 50000 -823.910
21483 4 57100 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 47417.379 50000 -2582.621
66009 4 22500 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 46287.512 50000 -3712.488
71181 1 7900 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 47532.562 50000 -2467.438
55346 5 2236 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 46100.117 50000 -3899.883
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47868 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
47918 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
14672 17 192000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 4915.553 500 4415.553
26146 21 140007 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 749.533 500 249.533
38105 19 197000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 750.442 500 250.442

62304 rows × 100 columns

In [106]:
from bokeh.io import output_notebook, show
output_notebook()

from bokeh.plotting import Figure
Loading BokehJS ...
In [126]:
train = Figure(title = 'TRAIN : Precio real vs Predicciones', width=800, height=400, tools=['pan',
                               'xwheel_zoom',   
                               'hover',
                               'reset', 
                               'help'])
train.scatter(x = predXgboost_train, y = y_train, size=10, fill_color='#DA1313')

train.yaxis.axis_label='Predicciones'
train.xaxis.axis_label='Real'
show(train)
In [108]:
predXgboost_train_df["diferencia"].mad()
Out[108]:
1195.1600614702165
In [109]:
predXgboost_test = gs_xgb_reg.predict(x_test)

predXgboost_test
Out[109]:
array([ 4351.176, 11426.97 , 27498.592, ..., 11860.888, 16020.244,
        7016.037], dtype=float32)
In [110]:
predXgboost_test_df = x_test
In [111]:
predXgboost_test_df['predXgboost_test'] = predXgboost_test
In [112]:
predXgboost_test_df.head()
Out[112]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_custom paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test
27026 11 159427 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 4351.176
31692 10 153164 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 11426.970
1520 5 151318 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 27498.592
59465 7 75700 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 13862.031
67441 17 191588 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 4214.032

5 rows × 98 columns

In [113]:
predXgboost_test_df['price'] = y_test
In [114]:
predXgboost_test_df.head()
Out[114]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_green paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price
27026 11 159427 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 4351.176 5890
31692 10 153164 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 11426.970 9995
1520 5 151318 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 27498.592 23999
59465 7 75700 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 13862.031 11900
67441 17 191588 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 4214.032 3800

5 rows × 99 columns

In [115]:
predXgboost_test_df['diferencia'] = predXgboost_test_df['predXgboost_test'] - y_test
In [116]:
predXgboost_test_df = predXgboost_test_df.sort_values(by ='price', ascending = False)
In [117]:
predXgboost_test_df
Out[117]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
53007 0 2315 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 35676.855 50000 -14323.145
42034 0 1800 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42111.445 50000 -7888.555
63242 4 136600 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42574.672 50000 -7425.328
70819 3 10500 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 30308.906 50000 -19691.094
25341 2 47000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 32528.229 49999 -17470.771
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48518 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
41212 2 2520 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 31718.785 500 31218.785
74640 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
68171 18 205000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2910.247 500 2410.247
36515 17 160000 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 930.583 500 430.583

15576 rows × 100 columns

In [118]:
predXgboost_test_df.reset_index(drop=True, inplace=True)
predXgboost_test_df
Out[118]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
0 0 2315 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 35676.855 50000 -14323.145
1 0 1800 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42111.445 50000 -7888.555
2 4 136600 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42574.672 50000 -7425.328
3 3 10500 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 30308.906 50000 -19691.094
4 2 47000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 32528.229 49999 -17470.771
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15571 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
15572 2 2520 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 31718.785 500 31218.785
15573 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
15574 18 205000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2910.247 500 2410.247
15575 17 160000 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 930.583 500 430.583

15576 rows × 100 columns

In [127]:
test = Figure(title = 'TEST : Precio real vs Predicciones', width=800, height=400, tools=['pan',
                               'xwheel_zoom',   
                               'hover',
                               'reset', 
                               'help'])
test.scatter(x = predXgboost_test, y = y_test, size=10, fill_color='#13DA2B')

test.yaxis.axis_label='Predicciones'
test.xaxis.axis_label='Real'
show(test)
#Positive correlation
In [120]:
predXgboost_test_df["diferencia"].mad() 
Out[120]:
1842.4206098594052
In [121]:
round(predXgboost_test_df["diferencia"].mad() - predXgboost_train_df["diferencia"].mad(), 2)
Out[121]:
647.26
Find some examples of the following: 1. Cars whose price are very underestimated by the model. 2. Cars whose price are very overestimated by the model. 3. Cars whose price are almost predicted exactly by the model. Can you figure out the reasons behind 1 and 2?
In [122]:
predXgboost_test_df
Out[122]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
0 0 2315 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 35676.855 50000 -14323.145
1 0 1800 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42111.445 50000 -7888.555
2 4 136600 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42574.672 50000 -7425.328
3 3 10500 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 30308.906 50000 -19691.094
4 2 47000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 32528.229 49999 -17470.771
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15571 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
15572 2 2520 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 31718.785 500 31218.785
15573 7 44995 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2319.616 500 1819.616
15574 18 205000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 2910.247 500 2410.247
15575 17 160000 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 930.583 500 430.583

15576 rows × 100 columns

In [128]:
infra_df= predXgboost_test_df [predXgboost_test_df ['diferencia'] < -3062.64]  
                               
infra_df
Out[128]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
0 0 2315 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 35676.855 50000 -14323.145
1 0 1800 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42111.445 50000 -7888.555
2 4 136600 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 42574.672 50000 -7425.328
3 3 10500 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 30308.906 50000 -19691.094
4 2 47000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 32528.229 49999 -17470.771
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11623 11 181626 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1892.651 5500 -3607.349
11633 21 214000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 1802.560 5500 -3697.440
12073 21 250000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 1874.540 5000 -3125.460
12320 28 85235 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 701.932 4990 -4288.068
12838 21 214036 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1083.462 4500 -3416.538

1268 rows × 100 columns

In [129]:
sobre_df= predXgboost_test_df [predXgboost_test_df ['diferencia'] > 3062.64]  
                               
sobre_df
Out[129]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
48 2 35499 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 52207.832 46995 5212.832
137 3 39665 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 51740.793 44750 6990.793
200 2 55259 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 46726.535 42900 3826.535
211 2 40000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 47502.191 42500 5002.191
424 1 21000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 42089.918 38000 4089.918
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15547 3 23829 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 34347.172 507 33840.172
15548 2 19757 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 34656.332 507 34149.332
15552 1 41970 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 15280.449 500 14780.449
15558 29 200000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 5175.416 500 4675.416
15572 2 2520 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 31718.785 500 31218.785

1341 rows × 100 columns

In [130]:
casi_df = predXgboost_test_df [(predXgboost_test_df ['diferencia'] > 0) & (predXgboost_test_df ['diferencia'] < 100)]
                               
casi_df
Out[130]:
age odometer manufacturer_acura manufacturer_alfa-romeo manufacturer_aston-martin manufacturer_audi manufacturer_bmw manufacturer_buick manufacturer_cadillac manufacturer_chevrolet ... paint_color_grey paint_color_orange paint_color_purple paint_color_red paint_color_silver paint_color_white paint_color_yellow predXgboost_test price diferencia
85 2 165053 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 45939.434 45900 39.434
86 2 165053 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 45939.434 45900 39.434
87 2 165053 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 45939.434 45900 39.434
89 2 165053 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 45939.434 45900 39.434
158 2 31216 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 44067.941 43990 77.941
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15268 14 178000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1594.271 1500 94.271
15274 20 233449 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 1576.058 1500 76.058
15294 15 260500 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1446.109 1400 46.109
15388 20 210000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1051.914 1000 51.914
15448 2 41000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 1010.621 999 11.621

420 rows × 100 columns

The above missing strategy is too restrictive. There're some columns that need to be there as we have no way of inferring `price`, `year`, `odometer`, `condition`, `title_status` and `paint_color`, but there's another column in the full CSV version (`model`, the specific model of the car), from which we can infer some specific information. That is, once we know `model`, we can try to infer from the rest of the dataset its `manufacturer`, `cylinders`, `fuel`, `transmission`, `drive` and `type`. For example, if we find `model='corsa 1.4d'` we can infer that `manufacturer='opel'` and `fuel='diesel'`). Can you implement this strategy? How many more rows do you obtain now? Repeat the train/test split and the model training with this extended dataset. Can you improve on previous results? By how much?
Besides `model`, there are yet more features in that full CSV file that can be loaded: * `url`: URL with the original second-hand car ad. * `image_url`: URL with a photo of the car. * `region`: region in the United States where the car was being sold. * `region_url`: URL with Craigslist link of that region. * `vin`: Vehicle Identification Number of the car. * `description`: the text on the car ad itself (kind of free text). * `county`: US county where the car was being sold. * `state`: US state where the car was being sold. * `lat`: exact latitude of the car. * `lon`: exact longitude of the car. Which of these would you use to build better models? Try to use any of these and see if you can further improve (more points the more you try and the more you improve!).